Published 2004-11-01 22:59:54

When John released his bindings to html tidy, I joked with him, that it would have been far more interesting (as a project), to write a proper HTML lexer, rather than bind to an existing library. (mainly cause having written one in PHP, I didnt think it would be that difficult), and I have a strange idea of fun...

Well, over the weekend, I was re-pondering this. Partly due to the fact I had used the Flexy Parser to try and parse HTML from a web site, and found the tokenizer in Flexy was getting slower with age (5seconds on average to parse a page). While this is not a huge issue normally, as this parsing is cached during the compiling phase of template engine. It is a huge issue if you are pulling pages down, parsing out the forms, and reposting the forms in a web test script.

So over the weekend after a little google search and discover trip, I ran across a little w3c project, "A Lexical Analyzer for HTML and SGML", It looked interesting, but it wasnt until I pulled the code down, untared and built it, that I realized it could be used to write a really fast, and simple HTML tokenizer. (not only that, it could easily form the basis of a C based backend for Flexy.)

To create an extension that used the code (not a library, but just pulled in the C code into a PHP extension), and parse a string of HTML took about 30 minutes.. - It took an extra 3 hours, on and off over a few days, to make it return a array of tokens (with attributes sorted into a sensible structure.)

So now I have a cute extension that has 1 function, and 1 result, KISS at it's best..

<?php
print_r(
flexyparser_tokenize(
file_get_contents("..some file...")
));

Outputs:


    [0] => Array
        (
            [0] => 14 // token type (look up the source)
            [1] =>    // data (tag name or string)
            [2] => 1  // line number
            [3] => 0  // character position
        )

    [1] => Array
        (
            [0] => 1
            [1] =>

            [2] => 2
            [3] => 50
        )

    [2] => Array
        (
            [0] => 2
            [1] => HTML
            [2] => 2
            [3] => 51
        )

    [3] => Array
        (
            [0] => 2
            [1] => HEAD
            [2] => 2
            [3] => 57
        )
.....
......
     [15] => Array
         (
            [0] => 2
            [1] => A
            [2] => 6
            [3] => 212
            [4] => Array  // array of attributes
                (
                    [HREF] => "/pub/WWW/Consortium/"
                )

        )

    [16] => Array
        (
            [0] => 2
            [1] => IMG
            [2] => 7
            [3] => 243
            [4] => Array
                (
                    [align] => bottom

                    [src] => "/pub/WWW/Icons/WWW/w3c_48x48"
                )

        )

the code is in my svn server, under akpear/flexyparser, works perfectly with PHP5 and PHP4 at the moment.

I really want to do a tree version of this, that loads data into a user defined object: eg.
<?php
$tree = flexyparser_toTree($data, new MyClass);

so it can be used 'how you want it...'

Mentioned By:
www.experts-exchange.com : PHP: PHP function to parse table cell contents? (250 referals)
google.com : php5 html parser (81 referals)
google.com : november (72 referals)
google.com : php html tokenizer (56 referals)
google.com : april (45 referals)
google.com : php parse html (31 referals)
google.com : PHP HTML parser (21 referals)
google.com : php "html to array" (15 referals)
marc.theaimsgroup.com : MARC: msg '[PECL-DEV] flexyparser.. or anothername..' (13 referals)
google.com : html parser php (12 referals)
google.com : php html_parse (12 referals)
google.com : html parser php5 (11 referals)
google.com : html parse php4 (10 referals)
google.com : PHP parse array (10 referals)
google.com : parsing html with php (9 referals)
google.com : php parse html to array (9 referals)
google.com : PHP5 parse HTML (9 referals)
google.com : html tokenizer php (8 referals)
google.com : parse array php (8 referals)
google.com : php "parse HTML" (8 referals)

Related

Making simple things easy, and difficult things possible. yet another html parser.

Comments

Add Your Comment

Related

Comments

Add Your Comment

Follow us on

OUR BLOG